This is quite new: some packages need quite large files to work well.
When you run the code below, it might take a while (say… a minute).
That’s because textstem will be downloading some files it
needs to function properly.
About half way down the notebook there will be another moment like that, around “We start by downloading a language model”. But there it will take more like 5 minutes to download the file. Oh well… at least you only need to do this once ever!
WHEN YOU RUN THE CODE BELOW YOU MIGHT BE ASKED IF YOU’D LIKE TO RELOAD RSTUDIO: ANSWER ‘NO’
# for all packages we need:
#install.packages("pacman")
pacman::p_load(dplyr, stringr, udpipe, lattice, tidytext, readr, SnowballC, textstem)
Let’s get right to work. We’ll start by loading the data:
#Load the csv to a dataframe
file_path='./data/CORONA_TWEETS.csv'
Corona_NLP_DF <- read.csv(file_path)
#Converting into the tibble dataframe
mydata_TB <- as_tibble(Corona_NLP_DF)
From the video you already know what Stem is, and why it is important. Now let’s look at the code to make use it stems.
We will use the SnowballC package to carry out stemming.
We will tokenise our text column to words and apply the stemmer
wordStem. Have a look at the tibble it produces and compare
the word and stem columns.
You can see more options using the help(wordStem). For
example you can change the language and call different stemmers.
library(SnowballC)
mydata_TB %>%
unnest_tokens(output = word, input= text) %>%
mutate(stem = wordStem(word))
# have a look at the first few pages of the result below (especially the word and stem columns). Are any of them surprising? Why do you think they are this way? Would you have stemmed them differently?
And here’s an example of applying stems to a simple string. Notice that since stemming can be applied to individual words, we will have to split the string into words, then get the stems, then stitch the sentence back together. It’s a bit crude, but will show what we can do with stems. Below is a slightly simpler example waiting.
my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")
stems <- my_text_about_cats %>%
strsplit("\\s+") %>%
unlist() %>%
wordStem() %>%
paste(collapse = " ")
stems
## [1] "My cat i tire today. She i sleep on her mat in the sun. She i dream about run and eat now. She usual wake when she i hungry."
Wouldn’t it be nice if the library could do the splitting into words
for us? Of course! Indeed, this is R, so obviously there is a function
for that! Indeed there are two functions we can use here:
stem_words() and stem_strings()
Notice an important difference between those two methods: -
stem_words() expects a vector of words, and will find stem
in each item in that vector. So you usually have to do the work of
preparing the data, but also you have more control over the result.
c("She", "is", "dreaming", "about", "running", "and", "eating", "now") %>%
stem_words()
## [1] "She" "i" "dream" "about" "run" "and" "eat" "now"
stem_strings() is more forgiving, as it expects whole
strings (e.g. sentences) and will stem each word in each of those
sentences. But the result is a sentence, which gives you less fine
control.my_text_about_cats %>%
stem_strings()
## [1] "My cat i tire todai. She i sleep on her mat in the sun. She i dream about run and eat now. She usual wake when she i hungri."
Also from the video you know what Lemmas are. But now let’s see them in code.
Here we use the textstem package to produce lemmas.
(This package will also do stemming). Just like with above examples
there are two functions we’ll use: one expects vector as words, and one
vector of more complex strings (which it will split into words by
itself).
Below are some interesting examples:
# btw. depending on your R environment, you might need a syuzhet package
if (!require("syuzhet")){
install.packages('syuzhet')
}
## Loading required package: syuzhet
## Warning: package 'syuzhet' was built under R version 4.4.3
library(syuzhet)
library(textstem)
vector <- c("run", "ran", "running", "walked", "walks", "walking")
lemmatize_words(vector) # takes collection of words!
## [1] "run" "run" "run" "walk" "walk" "walk"
Luckily for us, lemmatize_strings can be applied to whole sentences (and will get the lemmas of each individual words by itself).
my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")
lemmatize_strings(my_text_about_cats) # takes whole sentences (and can take many)
## [1] "My cat be tire today. She be sleep on her mat in the sun. She be dream about run and eat now. She usually wake when she be hungry."
Question to ponder: what would happen if you put a whole sentence in
lemmatize_words? and why?
Try using this package to do stemming and compare to the results from above. Bring bits of code that take whole string from above and reduce it to stems and lemmas. Compare their differences. What do you see? Does it make sense in context of the video you’ve seen?
my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")
# bring pieces of code from above here. One that will turn in into lemmas, one which will turn it into stems. What differences in output do you see?
my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")
# your code here
Come back to this later: To understand this better, you can have a look at the help function for the functions and you will see you can call different dictionaries and lexicons to support stemming and lemmas and these change the results you get.
There are many different POS tagging tools available in R. We will
use UDPipe but you may also want to look at the R
implementation of openNLP. UDPipe also does lemmatisation
and dependency parsing.
To use UDPipe you need to use a language model.
We start with downloading a language model in English.
UDPipe will work in other languages.
We then create data from our Corona tweets that we want to POS tag.
Next we are going to “annotate” that data with POS tags.
UDPipe produces to types of POS tags upos and
xpos. upos is universal part of speech tagging
and xpos is language specific
#We start by downloading a language model
model_eng_ewt <- udpipe_download_model(language = "english-ewt")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to F:/Paul Smith/04-r_testing/short-NLP-1-HSC-notes/english-ewt-ud-2.5-191206.udpipe
## - This model has been trained on version 2.5 of data from https://universaldependencies.org
## - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
## - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
## - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'F:/Paul Smith/04-r_testing/short-NLP-1-HSC-notes/english-ewt-ud-2.5-191206.udpipe'
model_eng_ewt_path <- model_eng_ewt$file_model
#To load our downloaded model, use the udpipe_load_model() function:
model_eng_ewt_loaded <- udpipe_load_model(file = model_eng_ewt_path)
#Creating a text variable
text <- Corona_NLP_DF$text %>% str_squish()
#Annotate data - this may take a moment
text_annotated <- udpipe_annotate(model_eng_ewt_loaded, x = text) %>%
as.data.frame()
#look at text_annotated - look at the extra columns upos and xpos give you the POS tags,
text_annotated
We can now access the text_annotated data frame to graph the frequencies of POS tags and look at the highest occurring types. We do this for nouns and adjectives.
# Now you can display the most popular parts of speech:
txt_freq(text_annotated$xpos)
First read the code below. What does it do? Once you understand it enough there is a task waiting for you below, where you will have to change it:
#Look at the frequency of the POS tags
freq <- txt_freq(text_annotated$xpos)
print(freq)
## key freq freq_pct
## 1 NN 28259 19.175544548
## 2 IN 14504 9.841894551
## 3 DT 9000 6.107077424
## 4 NNS 8766 5.948293411
## 5 NNP 8202 5.565583226
## 6 JJ 7792 5.287371921
## 7 . 7171 4.865983579
## 8 VB 6886 4.672592794
## 9 RB 6565 4.454773699
## 10 PRP 5713 3.876637036
## 11 , 5095 3.457284386
## 12 VBP 4284 2.906968854
## 13 CC 4209 2.856076542
## 14 VBG 3780 2.564972518
## 15 CD 3418 2.319332293
## 16 VBZ 2955 2.005157088
## 17 TO 2493 1.691660446
## 18 VBN 2256 1.530840741
## 19 PRP$ 1941 1.317093031
## 20 VBD 1921 1.303521748
## 21 MD 1773 1.203094253
## 22 ADD 1308 0.887561919
## 23 RP 1018 0.690778313
## 24 WRB 699 0.474316347
## 25 HYPH 697 0.472959218
## 26 UH 643 0.436316754
## 27 WP 640 0.434281061
## 28 : 624 0.423424035
## 29 GW 395 0.268032843
## 30 JJR 387 0.262604329
## 31 -RRB- 345 0.234104635
## 32 NNPS 343 0.232747506
## 33 NFP 327 0.221890480
## 34 WDT 318 0.215783402
## 35 JJS 313 0.212390582
## 36 -LRB- 282 0.191355093
## 37 EX 225 0.152676936
## 38 PDT 216 0.146569858
## 39 `` 209 0.141819909
## 40 '' 200 0.135712832
## 41 LS 194 0.131641447
## 42 RBR 185 0.125534369
## 43 POS 179 0.121462984
## 44 $ 178 0.120784420
## 45 SYM 164 0.111284522
## 46 FW 124 0.084141956
## 47 RBS 95 0.064463595
## 48 AFX 76 0.051570876
## 49 WP$ 3 0.002035692
and now let’s visualise that data:
#Create a barcharts to look at the frequencies of upos types of POS tags
freq.distribution.upos <-
txt_freq(text_annotated$upos)
freq.distribution.upos$key <-
factor(freq.distribution.upos$key,
levels = rev(freq.distribution.upos$key))
barchart(
key ~ freq,
data = freq.distribution.upos,
col = "dodgerblue",
main = "UPOS frequencies",
xlab = "Freq"
)
## NOUNS
nouns <- subset(text_annotated, upos %in% c("NOUN"))
nouns <- txt_freq(nouns$token)
nouns$key <- factor(nouns$key, levels =
rev(nouns$key))
barchart(key ~ freq, data = head(nouns, 20),
col ="cadetblue",
main = "Most occurring nouns",
xlab = "Freq")
## ADJECTIVES
adj <- subset(text_annotated, upos %in% c("ADJ"))
adj <- txt_freq(adj$token)
adj$key <- factor(adj$key, levels = rev(adj$key))
barchart(key ~ freq, data = head(adj, 20),
col = "purple",
main = "Most occurring adjectives",
xlab ="Freq")
Now it’s your turn. Copy-paste the code above and make changes to it: Add another type of POS tag in the code - copy the block for NOUN or ADJECTIVE and replace the POS tag type for another POS tag type.
# code can come here
Now is a good moment to write down your self reflection: think of 3 STARS (things that you learned in this badge), and 1 WISH: a thing you wish you understood better. You might also thing of what would you do to fulfil your wish. Write them down.
Now you’ve seen using Lemmas, Stems and POS in action. It will open a whole new world of practical NLP.